{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# COMPSCI 389: Introduction to Machine Learning\n", "# Topic 2.1: Pandas and Data Sets\n", "\n", "This notebook provides a description of how data sets are represented and manipulated using the `pandas` library." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is pandas?\n", "\n", "Pandas stands for \"PANel DAta,\" an econometric term for data sets. Webpage: [link](https://pandas.pydata.org/docs/index.html).\n", "\n", "It provides two main objects: a **DataFrame** and a **Series**.\n", "\n", "A DataFrame object stores a 2-dimensional table of data, while a Series stores a 1-dimensional vector of data.\n", "\n", "Pandas provides useful functions for working with these objects including functions for:\n", "1. Loading data sets from files and storing them in DataFrame and/or Series objects.\n", "2. Manipulating DataFrame and Series objects (e.g., adding or removing features).\n", "3. Computing statistics of the data (e.g., the minimum and maximum values of features).\n", "\n", "Pandas has become so common that many other ML libraries in python are built to be compatible with pandas, as we will see below.\n", "\n", "To install pandas, run the following command in the console or command line:\n", "\n", "> pip install pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Data Sets\n", "\n", "In the remainder of this notebook be load and inspect a few example data sets for supervised learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GPA Data\n", "\n", "The GPA data set contains data about undergraduate students and the *Universidade Federal do Rio Grande do Sul* (UFRGS) in Brazil.\n", "\n", "**Input**: Scores on 9 entrance exams: \n", "1. Physics\n", "2. Biology\n", "3. History\n", "4. English\n", "5. Geography\n", "6. Literature\n", "7. Portuguese\n", "8. Math\n", "9. Chemistry\n", "\n", "**Output**: GPA on a 4.0 scale during the first three semesters at university.\n", " - The GPA can be used for regression (predict the GPA) or classification (predict the GPA range, e.g., whether it is at least 3.0).\n", "\n", "**Data set Size**: 43,303\n", "\n", "Let's start by loading and displaying this data set. The data set is available here:\n", "\n", "[https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv](https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv)\n", "\n", "You can download it and place it inside a directory called `data`, next to this .ipynb file, and can load the data set from this local copy, or you can directly load it from the online posting:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | physics | \n", "biology | \n", "history | \n", "English | \n", "geography | \n", "literature | \n", "Portuguese | \n", "math | \n", "chemistry | \n", "gpa | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "622.60 | \n", "491.56 | \n", "439.93 | \n", "707.64 | \n", "663.65 | \n", "557.09 | \n", "711.37 | \n", "731.31 | \n", "509.80 | \n", "1.33333 | \n", "
1 | \n", "538.00 | \n", "490.58 | \n", "406.59 | \n", "529.05 | \n", "532.28 | \n", "447.23 | \n", "527.58 | \n", "379.14 | \n", "488.64 | \n", "2.98333 | \n", "
2 | \n", "455.18 | \n", "440.00 | \n", "570.86 | \n", "417.54 | \n", "453.53 | \n", "425.87 | \n", "475.63 | \n", "476.11 | \n", "407.15 | \n", "1.97333 | \n", "
3 | \n", "756.91 | \n", "679.62 | \n", "531.28 | \n", "583.63 | \n", "534.42 | \n", "521.40 | \n", "592.41 | \n", "783.76 | \n", "588.26 | \n", "2.53333 | \n", "
4 | \n", "584.54 | \n", "649.84 | \n", "637.43 | \n", "609.06 | \n", "670.46 | \n", "515.38 | \n", "572.52 | \n", "581.25 | \n", "529.04 | \n", "1.58667 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
43298 | \n", "519.55 | \n", "622.20 | \n", "660.90 | \n", "543.48 | \n", "643.05 | \n", "579.90 | \n", "584.80 | \n", "581.25 | \n", "573.92 | \n", "2.76333 | \n", "
43299 | \n", "816.39 | \n", "851.95 | \n", "732.39 | \n", "621.63 | \n", "810.68 | \n", "666.79 | \n", "705.22 | \n", "781.01 | \n", "831.76 | \n", "3.81667 | \n", "
43300 | \n", "798.75 | \n", "817.58 | \n", "731.98 | \n", "648.42 | \n", "751.30 | \n", "648.67 | \n", "662.05 | \n", "773.15 | \n", "835.25 | \n", "3.75000 | \n", "
43301 | \n", "527.66 | \n", "443.82 | \n", "545.88 | \n", "624.18 | \n", "420.25 | \n", "676.80 | \n", "583.41 | \n", "395.46 | \n", "509.80 | \n", "2.50000 | \n", "
43302 | \n", "512.56 | \n", "415.41 | \n", "517.36 | \n", "532.37 | \n", "592.30 | \n", "382.20 | \n", "538.35 | \n", "448.02 | \n", "496.39 | \n", "3.16667 | \n", "
43303 rows × 10 columns
\n", "\n", " | physics | \n", "biology | \n", "history | \n", "English | \n", "geography | \n", "literature | \n", "Portuguese | \n", "math | \n", "chemistry | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "622.60 | \n", "491.56 | \n", "439.93 | \n", "707.64 | \n", "663.65 | \n", "557.09 | \n", "711.37 | \n", "731.31 | \n", "509.80 | \n", "
1 | \n", "538.00 | \n", "490.58 | \n", "406.59 | \n", "529.05 | \n", "532.28 | \n", "447.23 | \n", "527.58 | \n", "379.14 | \n", "488.64 | \n", "
2 | \n", "455.18 | \n", "440.00 | \n", "570.86 | \n", "417.54 | \n", "453.53 | \n", "425.87 | \n", "475.63 | \n", "476.11 | \n", "407.15 | \n", "
3 | \n", "756.91 | \n", "679.62 | \n", "531.28 | \n", "583.63 | \n", "534.42 | \n", "521.40 | \n", "592.41 | \n", "783.76 | \n", "588.26 | \n", "
4 | \n", "584.54 | \n", "649.84 | \n", "637.43 | \n", "609.06 | \n", "670.46 | \n", "515.38 | \n", "572.52 | \n", "581.25 | \n", "529.04 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
43298 | \n", "519.55 | \n", "622.20 | \n", "660.90 | \n", "543.48 | \n", "643.05 | \n", "579.90 | \n", "584.80 | \n", "581.25 | \n", "573.92 | \n", "
43299 | \n", "816.39 | \n", "851.95 | \n", "732.39 | \n", "621.63 | \n", "810.68 | \n", "666.79 | \n", "705.22 | \n", "781.01 | \n", "831.76 | \n", "
43300 | \n", "798.75 | \n", "817.58 | \n", "731.98 | \n", "648.42 | \n", "751.30 | \n", "648.67 | \n", "662.05 | \n", "773.15 | \n", "835.25 | \n", "
43301 | \n", "527.66 | \n", "443.82 | \n", "545.88 | \n", "624.18 | \n", "420.25 | \n", "676.80 | \n", "583.41 | \n", "395.46 | \n", "509.80 | \n", "
43302 | \n", "512.56 | \n", "415.41 | \n", "517.36 | \n", "532.37 | \n", "592.30 | \n", "382.20 | \n", "538.35 | \n", "448.02 | \n", "496.39 | \n", "
43303 rows × 9 columns
\n", "